Supervised Evaluation of Dataset Partitions: Advantages and Practice

نویسندگان

  • Sylvain Ferrandiz
  • Marc Boullé
چکیده

In the context of large databases, data preparation takes a greater importance : instances and explanatory attributes have to be carefully selected. In supervised learning, instances partitioning techniques have been developped for univariate representations, leading to precise and comprehensible evaluations of the amount of information contained in an attribute, with respect to the target attribute. Still, the multivariate case remains unstated. In this paper, we describe the partitioning intrinsic convenience for data preparation and we settle a framework for supervised partitioning. A new evaluation criterion of labelled objects partitions, which is based on Minimum Description Length principle, is then set and tested on real and synthetic data sets. 1 Supervised partitioning problems in data preparation In a data mining project, the data preparation phase is a key one. Its main goal is to provide a clean and representative database for the consecutive modelling phase [3]. Typically, topics like instances representation, instances selection and/or aggregation, missing values handling, attributes selection, are to be carefully dealt with. Among the many designed methods, partition-based one are often used, for their ability to comprehensibly summarize the information. The first examples that come in mind are clustering techniques, like the most popular one : K-means [11], which aim at partitioning instances. Building partitions hierarchy or mixture models is another way of doing unsupervised classification [5]. Combining clustering and attributes selection has led to the description of self-organizing feature maps [10]. In the supervised context, induction tree models are plainly partition-based [2],[12],[8]. These models build a hierarchy of instances groups relying on the discriminating power of the explanatory attributes with respect to the categorical target attribute. As the naive Bayes classifier, they need to discretise the continuous explanatory attributes to make probability estimations more accurate. As discretisation is the typical univariate supervised partitioning problem, we now take a closer look at it. The objective of the discretisation of a single continuous explanatory attribute is to find a partition of the values of this attribute which best discriminates the target distributions between groups. These groups are intervals and the partition evaluation is based on a compromise : fewer intervals and stronger target discrimination are better. There are mainly two families of search algorithms : bottom-up greedy agglomerative heuristics and top-down greedy divisive ones. Discrimination can be evaluated in four ways using statistical test, entropy, description length or bayesian prior : – Chimerge [9] applies chi square measure to test the independance of the distributions between groups, – C4.5 [12] uses Shannon’s entropy based information measures to find the most informative partition, – MDLPC [6] defines a description length measure, following the Minimum Description Length principle [13], – MODL [1] states a prior probability distribution, leading to a bayesian evaluation of the partitions. The discretisation problem is illustrative of the convenience of supervised partitioning methods for data preparation since it addresses simultaneously the three following problems : – Data representation : a suitable representation of the objects at hand have to be selected. Partitioning is an efficient mean to evaluate representations quality (in the supervised context, statistical test for class separability is another one, cf. [14]). – Interpretability : labelled groups result from an understandable compromise between partition simplicity and target discrimination. – Comparison capacity : explanatory attributes effects on the target can be quickly compared. These themes are intertwined and play a crucial role in the data preparation phase (cf. Table 1 for an intuitive illustration in the multivariate case). The goal of this paper is to set a framework for supervised partitioning and to specify an evaluation criterion, preserving the interpretability bias and allowing not to consider single continuous attributes only. In the remainder of the paper, we first set our framework and a description method of partitions (section 2). Then, we propose a new evaluation criterion (section 3) and we test its validity on real and synthetic datasets (section 4). Finally, we conclude and point out future works (section 5). Labels distributions in groups Group 1 Group 2 Group 3 Explanatory attributes Set. Ver. Vir. Set. Ver. Vir. Set. Ver. Vir. Sepal width, sepal length, 50 0 0 0 50 0 0 0 50 petal width, petal length Petal width, petal length 50 0 0 0 50 1 0 0 49 Sepal width, sepal length 50 2 1 0 48 49 Petal width 50 0 0 0 48 0 0 2 50 Table 1. Examples of resulting partitions of Fisher’s Iris database for different representation spaces. Partitioning techniques allow, among other things, to carry out the selection of an attribute subset in an intelligible way, as the results are quickly interpretable and easily comparable. Here, we see that the three iris categories (Setosa, Versicolor and Virginica) are completely discrimated by the four attributes. However, one can consider petal width only. Furthermore, one can state that setosas distinguish themselves by their sepal width and length. 2 Graph constrained supervised partitioning Let O = {o1, . . . , oN} be a finite set of objects. A target ln lying in an alphabet of size J is associated to each object on and a graph structure G is set on O. This structure can be natural (road networks, web graphs, . . . ) or imposed (proximity graphs, partial orders, . . . ). In the remainder, we will suppose G non-oriented. Our problem consists in finding an optimal partition of G, considering partitions composed of connected groups with respect to the discrete structure (i.e connected partitions). As explained above, optimality of a partition relies on the correct balance between the structure of its groups and its discriminating power (cf Figure 1). The setting of the balance requires the definition of description parameters both for the structure and the target distribution. Fig. 1. 2 classes problem : which is the ”best” partition? Let π be a connected partition of G. We now introduce an effective and interpretable bias. We consider the balls induced by the discrete metric δ : δ(o1, o2) is the minimum number of edges needed to link o1 and o2. As illustrated by Figure 2, each group of π is then covered with δ-balls. Fig. 2. Applying algorithm 1 : description of a partition with non-intersecting balls (B(a, 2), B(b, 1), B(c, 1), B(d, 0)) defined by the graph distance. The method consists in selecting non-intersecting balls that are included in a group of π. At each step, the biggest one is picked up :

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Wised Semi-Supervised Cluster Ensemble Selection: A New Framework for Selecting and Combing Multiple Partitions Based on Prior knowledge

The Wisdom of Crowds, an innovative theory described in social science, claims that the aggregate decisions made by a group will often be better than those of its individual members if the four fundamental criteria of this theory are satisfied. This theory used for in clustering problems. Previous researches showed that this theory can significantly increase the stability and performance of...

متن کامل

Wised Semi-Supervised Cluster Ensemble Selection: A New Framework for Selecting and Combing Multiple Partitions Based on Prior knowledge

The Wisdom of Crowds, an innovative theory described in social science, claims that the aggregate decisions made by a group will often be better than those of its individual members if the four fundamental criteria of this theory are satisfied. This theory used for in clustering problems. Previous researches showed that this theory can significantly increase the stability and performance of...

متن کامل

Extracting Prior Knowledge from Data Distribution to Migrate from Blind to Semi-Supervised Clustering

Although many studies have been conducted to improve the clustering efficiency, most of the state-of-art schemes suffer from the lack of robustness and stability. This paper is aimed at proposing an efficient approach to elicit prior knowledge in terms of must-link and cannot-link from the estimated distribution of raw data in order to convert a blind clustering problem into a semi-supervised o...

متن کامل

Using Supervised Clustering Technique to Classify Received Messages in 137 Call Center of Tehran City Council

Supervised clustering is a data mining technique that assigns a set of data to predefined classes by analyzing dataset attributes. It is considered as an important technique for information retrieval, management, and mining in information systems. Since customer satisfaction is the main goal of organizations in modern society, to meet the requirements, 137 call center of Tehran city council is ...

متن کامل

Using Supervised Clustering Technique to Classify Received Messages in 137 Call Center of Tehran City Council

Supervised clustering is a data mining technique that assigns a set of data to predefined classes by analyzing dataset attributes. It is considered as an important technique for information retrieval, management, and mining in information systems. Since customer satisfaction is the main goal of organizations in modern society, to meet the requirements, 137 call center of Tehran city council is ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005